Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

نویسندگان

  • Jinfeng Yi
  • Rong Jin
  • Anil K. Jain
  • Shaili Jain
چکیده

Crowdsourcing utilizes human ability by distributing tasks to a large number of workers. It is especially suitable for solving data clustering problems because it provides a way to obtain a similarity measure between objects based on manual annotations, which capture the human perception of similarity among objects. This is in contrast to most clustering algorithms that face the challenge of finding an appropriate similarity measure for the given dataset. Several algorithms have been developed for crowdclustering that combine partial clustering results, each obtained by annotations provided by a different worker, into a single data partition. However, existing crowdclustering approaches require a large number of annotations, due to the noisy nature of human annotations, leading to a high computational cost in addition to the large cost associated with annotation. We address this problem by developing a novel approach for crowclustering that exploits the technique of matrix completion. Instead of using all the annotations, the proposed algorithm constructs a partially observed similarity matrix based on a subset of pairwise annotation labels that are agreed upon by most annotators. It then deploys the matrix completion algorithm to complete the similarity matrix and obtains the final data partition by applying a spectral clustering algorithm to the completed similarity matrix. We show, both theoretically and empirically, that the proposed approach needs only a small number of manual annotations to obtain an accurate data partition. In effect, we highlight the trade-off between a large number of noisy crowdsourced labels and a small number of high quality labels.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approac

Crowdsourcing utilizes human ability by distributing tasks to a large number of workers. It is especially suitable for solving data clustering problems because it provides a way to obtain a similarity measure between objects based on manual annotations, which capture the human perception of similarity among objects. This is in contrast to most clustering algorithms that face the challenge of fi...

متن کامل

Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing. Despite its encouraging results, a key limitation of crowdclustering is that it can only cluster objects when their manual annotations are ava...

متن کامل

Errata: Distant Supervision for Relation Extraction with Matrix Completion

The essence of distantly supervised relation extraction is that it is an incomplete multi-label classification problem with sparse and noisy features. To tackle the sparsity and noise challenges, we propose solving the classification problem using matrix completion on factorized matrix of minimized rank. We formulate relation classification as completing the unknown labels of testing items (ent...

متن کامل

Distant Supervision for Relation Extraction with Matrix Completion

The essence of distantly supervised relation extraction is that it is an incomplete multi-label classification problem with sparse and noisy features. To tackle the sparsity and noise challenges, we propose solving the classification problem using matrix completion on factorized matrix of minimized rank. We formulate relation classification as completing the unknown labels of testing items (ent...

متن کامل

Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties

We present novel methods to construct compact natural language lexicons within a graphbased semi-supervised learning framework, an attractive platform suited for propagating soft labels onto new natural language types from seed data. To achieve compactness, we induce sparse measures at graph vertices by incorporating sparsity-inducing penalties in Gaussian and entropic pairwise Markov networks ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012